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Abstract 

It has been recently shown that, under the margin (or low noise) assump- 
tion, there exist classifiers attaining fast rates of convergence of the excess 
Bayes risk, i.e., the rates faster than The works on this subject sug- 

gested the following two conjectures: (i) the best achievable fast rate is of the 
order n~^, and (ii) the plug-in classifiers generally converge slower than the 
classifiers based on empirical risk minimization. We show that both conjec- 
tures are not correct. In particular, we construct plug-in classifiers that can 
achieve not only the fast, but also the super-fast rates, i.e., the rates faster 
than n~^. We establish minimax lower bounds showing that the obtained 
rates cannot be improved. 
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1 Introduction 

Let (X, Y) be a random couple taking values in Z = M.'^ x {0, 1} with joint distri- 
bution P. We regard X G M'^ as a vector of features corresponding to an object and 
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Y e {0,1} as a label indicating that the object belongs to one of the two classes. 
Consider the sample {Xi, Yi), . . . , y„), where (Xj, Yj) are independent copies of 
{X, Y) . We denote by P®" the product probabihty measure according to which the 
sample is distributed, and by Px the marginal distribution of X. 

The goal of a classification procedure is to predict the label Y given the value 
of X, i.e., to provide a decision rule / : M*^ — > {0, 1} which belongs to the set of 
all Borel functions defined on W'' and taking values in {0, 1}. The performance of a 
decision rule / is measured by the misclassification error 

R{f)^P{Yj^f{X)). 

The Bayes decision rule is a minimizer of the risk R{f) over all the decision rules 
f E J-', and one of such minimizers has the form f*{X) = ]l|^^(^^>i| where Ij.} 
denotes the indicator function and r]{X) = P(Y = 1\X) is the regression function 
of y on X (here P{dY\X) is a regular conditional probability which we will use in 
the following without further mention). 

An empirical decision rule (a classifier) is a random mapping /„ : Z"- — * 
measurable w.r.t. the sample. Its accuracy can be characterized by the excess risk 

£{fn)^ER{fr,)-R{n 

where E is the sign of expectation. A key problem in classification is to construct 
classifiers with small excess risk for sufficiently large n [cf. Devroye, Gyorfi and 
Lugosi (1996), Vapnik (1998)]. Optimal classifiers can be defined as those having 
the best possible rate of convergence of S{fn) to 0, as n — > oo. Of course, this rate, 
and thus the optimal classifier, depend on the assumptions on the joint distribution 
of {X,Y). A standard way to define optimal classifiers is to introduce a class of 
joint distributions of {X, Y) and to declare optimal if it achieves the best rate of 
convergence in a minimax sense on this class. 

Two types of assumptions on the joint distribution of {X, Y) are commonly used: 
complexity assumptions and margin assumptions. 

Complexity assumptions are stated in two possible ways. First of them is to 
suppose that the regression function rj is smooth enough or, more generally, belongs 
to a class of functions E having a suitably bounded £-entropy. This is called a 
complexity assumption on the regression function (CAR). Most commonly it is of 
the following form. 
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Assumption (CAR). The regression function t] belongs to class S of functions 
on M"' such that 

n{e,J:,Lp) < A,e-P, Ve > 0, 

with some constants p > 0, A^, > 0. Here T-C{6,T,, Lp) denotes the e-entropy of the 
set S w.r.t. an Lp norm with some 1 < p < oo. 

At this stage of discussion we do not identify precisely the value of p for the 
Lp norm in Assumption (CAR), nor the measure with respect to which this norm 
is defined. Examples will be given later. If S is a class of smooth functions with 
smoothness parameter /5 on a compact in M'^, for example, a Holder class, as de- 
scribed below, a typical value of p in Assumption (CAR) is p = d/p. 

Assumption (CAR) is well adapted for the study of plug-in rules, i.e. of the 
classifiers having the form 

f^\X) = ]I^,„(x)>i} (1-1) 

where /)„ is a nonparametric estimator of the function rj. Indeed, Assumption (CAR) 
typically reads as a smoothness assumption on t] implying that a good nonparametric 
estimator (kernel, local polynomial, orthogonal series or other) converges with 
some rate to the regression function 77, as n — 00. In turn, closeness of ?)„ to 77 
implies closeness of /„ to /: for any plug- in classifier f^^ we have 

ER{f^') - R{f*) <2eJ \Ux) - v{x)\Px{dx) (1.2) 

(cf. Devroye, Gyorfi and Lugosi (1996), Theorem 2.2). For various types of estima- 
tors fjn and under rather general assumptions it can be shown that, if (CAR) holds, 
the RHS of p.2|) is uniformly of the order n~^^^'^~^p\ and thus 

sup £(/r) = 0(n-i/(^+'^)), n ^ 00, (1.3) 

[cf. Yang (1999)]. In particular, ii p = d/ (3 (which corresponds to a class of smooth 
functions with smoothness parameter /5), we get 

sup £(/fO = 0(71-^/(2^+^^)), n^oo. (1.4) 

Note that ()1.4p can be easily deduced from ()1.2|) and standard results on the Li or 
L2 convergence rates of usual nonparametric regression estimators on /3-smoothness 
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classes E. The rates in (jl.3p . (jl.4|) are quite slow, always slower than n"^/^. In (|1.4p 
they deteriorate dramatically as the dimension d increases. Moreover, Yang (1999) 
shows that, under general assumptions, the bound p.4|) cannot be improved in a 
minimax sense. These results raised some pessimism about the plug-in rules. 

The second way to describe complexity is to introduce a structure on the class 
of possible decision sets G* = {x : f*{x) = 1} = {x : ri{x) > 1/2} rather than on 
that of regression functions rj. A standard complexity assumption on the decision 
set (CAD) is the following. 

Assumption (CAD). The decision set G* belongs to a class Q of subsets ofW^ 
such that 

n{e,g,dA) < A,e-f, V£>0, 

with some constants p > 0, A^, > 0. Here T-C{e,Q,d/\) denotes the e-entropy of 
the class Q w.r.t. the measure of symmetric difference pseudo-distance between sets 
defined by d/\{G,G') = Px{GAG') for two measurable subsets G and G' in M'^. 

The parameter p in Assumption (CAD) typically characterizes the smoothness 
of the boundary of G* [cf. Tsybakov (2004a)]. Note that, in general, there is no 
connection between Assumptions (CAR) and (CAD). Indeed, the fact that G* has 
a smooth boundary does not imply that rj is smooth, and vice versa. The values of 
p closer to correspond to smoother boundaries (less complex sets G*). As a limit 
case when p one can consider the Vapnik-Chervonenkis classes (VC-classes) for 
which the e-entropy is logarithmic in 1/e. 

Assumption (CAD) is suited for the study of empirical risk minimization (ERM) 
type classifiers introduced by Vapnik and Chervonenkis (1974), see also Devroye, 
Gyorfi and Lugosi (1996), Vapnik (1998). As shown in Tsybakov (2004a), for every 
< p < 1 there exist ERM classifiers /^^*^ such that, under Assumption (CAD), 

sup ^(/™) = 0{n-^l^), n ^ oo. (1.5) 

The rate of convergence in ()1.5p is better than that for plug-in rules, cf. fll.3|) - 
()1.4|) . and it does not depend on p (respectively, on the dimension d). Note that the 
comparison between (|1.5p and (|1.3|) - (|1.4j) is not quite legitimate, because there is no 
inclusion between classes of joint distributions P of (X, Y) satisfying Assumptions 
(CAR) and (CAD). Nevertheless, such a comparison have been often interpreted as 
an argument in disfavor of the plug-in rules. Indeed, Yang's lower bound shows that 
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the n~^/^ rate cannot be attained under Assumption (CAD) suited for the plug-in 
rules. Recently, advantages of the ERM type classifiers, including penalized ERM 
methods, have been further confirmed by the fact that, under the margin (or low 
noise) assumption, they can attain fast rates of convergence, i.e. the rates that are 
faster than n^^^'^ [Mammen and Tsybakov (1999), Tsybakov (2004a), Massart and 
Nedelec (2003), Tsybakov and van de Geer (2005), Koltchinskii (2005), Audibert 
(2004)]. 

The margin assumption (or low noise assumption) is stated as follows. 
Assumption (MA). There exist constants Cq > and a > such that 

Px (0 < \7]{X) -l/2\<t) < Cot'', V t > 0. (1.6) 

The case a = is trivial (no assumption) and is included for notational con- 
venience. Assumption (MA) provides a useful characterization of the behavior of 
regression function r/ in a vicinity of the level ?7 = 1/2 which turns out to be crucial 
for convergence of classifiers (for more discussion of the margin assumption see Tsy- 
bakov (2004a)). The main point is that, under (MA), fast classification rates up to 
are achievable. In particular, for every < p < 1 and a > there exist ERM 
type classifiers f^^^ such that 

sup S{f^^^^) =0{n-^T^p), n^oo, (1.7) 

P -.{CAD), (MA) 

where snpp.f^Q^^^ j^j^^^ denotes the supremum over all joint distributions P of {X, Y) 
satisfying Assumptions (CAD) and (MA). The RHS of JUIj) can be arbitrarily close 
to 0{n~^) for large a and small p. Result p.7j) for direct ERM classifiers on e-nets is 
proved by Tsybakov (2004a), and for some other ERM type classifiers by Tsybakov 
and van de Geer (2005), Koltchinskii (2005) and Audibert (2004) (in some of these 
papers the rate of convergence ()1.7|1 is obtained with an extra log- factor). 

Comparison of (jl.5|) and (jl.7|) with (jl.2p and (jl.3p seems to confirm the con- 
jecture that the plug-in classifiers are inferior to the ERM type ones. The main 
message of the present paper is to disprove this conjecture. We will show that there 
exist plug-in rules that converge with fast rates, and even with super-fast rates, i.e. 
faster than under the margin assumption (MA). The basic idea of the proof is 
to use exponential inequalities for the regression estimator r)„ (see Section 3 below) 



5 



or the convergence results in the L^o norm (see Section 5), rather than the usual 
Li or L2 norm convergence of 17^, as previously described (cf. ()1.2|) ). We do not 
know whether the super-fast rates are attainable for ERM rules or, more precisely, 
under Assumption (CAD) which serves for the study of the ERM type rules. It 
is important to note that our results on fast rates cover more general setting than 
just classification with plug-in rules. These are rather results about classification 
in the regression complexity context under the margin assumption. In particular, we 
establish minimax lower bounds valid for all classifiers, and we construct a "hybrid" 
plug-in/ ERM procedure (ERM based on a grid on a set regression functions rj) 
that achieves optimality. Thus, the point is mainly not about the type of procedure 
(plug-in or ERM) but about the type of complexity assumption (on the regression 
function (CAR) or on the decision set (CAD)) that should be natural to impose. 
Assumption (CAR) on the regression function arises in a natural way in the anal- 
ysis of several practical procedures of plug-in type, such as boosting and SVM [cf. 
Blanchard, Lugosi and Vayatis (2003), Bartlett, Jordan and McAuliffe (2003), Scovel 
and Steinwart (2003), Blanchard, Bousquet and Massart (2004), Tarigan and van de 
Geer (2004)]. These procedures are now intensively studied but, to our knowledge, 
only suboptimal rates of convergence have been proved in the regression complexity 
context under the margin assumption. The results in Section 4 point out this fact 
(see also Section 5), and establish the best achievable rates of classification that 
those procedures should expectedly attain. 



2 Notation and definitions 

In this section we introduce some notation, definitions and basic facts that will be 
used in the paper. 

We denote by C, Ci, C2, . . . positive constants whose values may differ from line 
to line. The symbols P and E stand for generic probability and expectation signs, 
and Ex is the expectation w.r.t. the marginal distribution Px- We denote by S(x, r) 
the closed Euclidean ball in centered at x G M'^ and of radius r > 0. 

For any multi-index s = (si, . . . , s^) G N"^ and any x = (xi, . . . , Xd) G M.'^, we 
define \s\ = "^f^i Si, s\ = Si\ . . . s^!, x'^ = x^^ . . . x^/ and ||x|| = (x^ + ■ ■ ■ + x'^Y^^. 
Let D'^ denote the differential operator = ^r-ir- ^r^- 
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Let P > 0. Denote by [/3J the maximal integer that is strictly less than p. For 
any x e R*^ and any [/3J times continuously differentiable real valued function g on 
R*^, we denote by its Taylor polynomial of degree [/3J at point x: 

9.{x') ^ E ^^^^D^g{x). 

\s\<m 

Let L > 0. The L, W^)-Hdlder class of functions, denoted E(/3, L, R'^), is de- 
fined as the set of functions gr : R'' — > R that are [/3J times continuously differentiable 
and satisfy, for any x,x' , the inequality 

\g{x')-g,{x')\<L\\x-xY- 

Fix some constants co,ro > 0. We will say that a Lebesgue measurable set 

A C R'^ is {co, To) -regular if 

X[AnB{x,r)] >coX[B{x,r)], V < r < tq, ^ x e A, (2.1) 

where \[S] stands for the Lebesgue measure of S* C R*^. To illustrate this definition, 
consider the following example. Let d > 2. Then the set A = \^x = (,xi, . . . ,Xd) £ 
R*^ ■ Sj=i \xj\'^ < l} is (co, ro)-regular with some cq, tq > for g > 1, and there are 
no Co, To > such that A is (cq, ro)-regular for < g < 1. 

Introduce now two assumptions on the marginal distribution Px that will be 
used in the sequel. 

Definition 2.1 Fix < Co,ro,yUinax < oo and a compact C C R*^. We say that the 
mild density assumption is satisfied if the marginal distribution Px is supported 
on a compact (cq, roj-regular set A C C and has a uniformly bounded density /j, w.r.t. 
the Lebesgue measure: ii{x) < /imax, V x e ^4. 

Definition 2.2 Fix some constants co,ro > and < /imin < /"max < oo and a 
compact C C R'^. We say that the strong density assumption is satisfied if the 
marginal distribution Px is supported on a compact [cQ^ro) -regular set A C C and 
has a density /i w.r.t. the Lebesgue measure bounded away from zero and infinity on 
A: 

A*min < A*(2^) < A*max for X & A, and iJ,{x) = otherwise. 
We finally recall some notions related to locally polynomial estimators. 
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Definition 2.3 For h > 0, x ^ M*^, for an integer I > and a function K : M.'^ ^ 
M+, denote by 6^ a polynomial on of degree I which minimizes 

j^[Y,-UX,-x)fK{^l^Y (2.2) 

i=l ^ ^ 

The locally polynomial estimator fj^^{x) of order /, or LP(l) estimator, of the 
value rj{x) of the regression function at point x is defined by: r)^^(x) = 9x{0) if 
is the unique minimizer of 112. ^) and r)^^(x) = otherwise. The value h is called 
the bandwidth and the function K is called the kernel of the LP (I) estimator. 

Let Ts denote the coefficients of indexed by multi-index s G N"': 6x{u) = 
J2\s\<i'^sU^ ■ Introduce the vectors T = (Ts)^g^^p V — (^s)|s|<; where 

Vs = J:tly^ix.-^rK{^), (2.3) 

U{u) = and the matrix Q = {Q 81,82) \s,i\s2\<i ^^^^^ 

Q8U82 = EtiiX. - x)-+-ir (^) . (2.4) 

The following result is straightforward (cf. Section 1.7 in Tsybakov (2004b) where 
the case c? = 1 is considered). 

Proposition 2.1 // the matrix Q is positive definite, there exists a unique poly- 
nomial on M.'^ of degree I minimizing A2.2^} . Its vector of coefficients is given by 
T = Q~^V and the corresponding LP (I) regression function estimator has the form 

3 Fast rates for plug-in rules: the strong density 
assumption 

We first state a general result showing how the rates of convergence of plug-in clas- 
sifiers can be deduced from exponential inequalities for the corresponding regression 
estimators. 

In the sequel, for an estimator fjn of rj, we write 
F{\UX) - v{X)\ >^)= j ^^"(|^n(a;) - v{x)\ > 5)Px{dx), V 5 > 0, 
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i.e., we consider the probability taken with respect to the distribution of the sample 
(Xi, Yi, . . . Xn, Yn) and the distribution of the input X. 

Theorem 3.1 Let fjn be an estimator of the regression function rj and V a set of 
probability distributions on Z such that for some constants Ci > 0, C2 > 0, for some 
positive sequence a„, for n > 1 and any 6 > 0, and for almost all x w.r.t. Px, we 
have 

supP®"(|r)„(x)-r/(x)| >5) < C.exp {- C2aJ^). (3.1) 

Consider the plug-in classifier = ll|^^>i|. // all the distributions P E V satisfy 
the margin assumption (MA), we have 



sup|Ei?(/„)-i?(r)| <c 



l + a 

an ' 



Pev 

for n > 1 with some constant C > depending only on a, Cq, Ci and C2. 
Proof. Consider the sets Aj C M*^, j = 1,2, ... , defined as 
Aq = {x eM.'^ -.0 <\ri{x) - ^\ <5}, 

Aj = {xER'^:2^-^6 <\r]{x)~l\<2^6}, for j > 1. 
For any 5 > 0, we may write 

ERiU)-Rif*) = E(|2r/(X)-l|%^(^)^^.(^)^) 

< 26Px{0<\v{X)-l\<6) 
+ E,>iE(|2r/(X) - 

On the event {/„ 7^ /*} we have |?7 — ^| < \Vn — vl- So, for any j > 1, we get 

< 2^+l5E[]I||^„(x)_,,(X)|>2^-i5}]I|o<|^(X)-i|<2^,5}] 

< 2^+'6Ex[P''^{\UX)-vm > 2^-i5)]I^o<|.(x)-i|<2.^}" 

< Ci 2^+15 exp ( - C2a^{2^-'6y')Px(0 < |r/(X) - i| < 2^6) 

< 2CiW(i+")5i+"exp ( - C2an{2^-^Sy) 

where in the last inequality we used Assumption (MA). Now, from inequality ()3.2j) . 
taking 6 = an and using Assumption (MA) to bound the first term of the right 
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hand side of (j3.2|) . we get 



l + a 



1 + a 

< Can ' . 



Inequahty ()3.Hl is crucial to obtain the above resuh. This inequahty holds true 
for various types of estimators and various sets of probability distributions V. Here 
we focus on a standard case where rj belongs to the Holder class L, R'^) and the 
marginal law of X satisfies the strong density assumption. We are going to show 

2/3 

that in this case there exist estimators satisfying inequality 1)3.11) with a„ = n^f+'^. 



These can be, for example, locally polynomial estimators. Specifically, assume from 
now on that K is a kernel satisfying 

3c> : K{x) > c]I{||^||<c}, Vx G R'^, (3.3) 
K{u)du = 1, (3.4) 



/ {l + \\u\\^f^)K^{u)du<oo, (3.5) 

sup (1 + ||Mf^)K(M) < oo. (3.6) 

Let h > 0, and consider the matrix B = (Bs,s„), ,, where Bg, g-, = 

Sr=i ( ^V^ ) ^ ^ ^ i ^'h^ ) ■ Define the regression function estimator 57* as fol- 
lows. If the smallest eigenvalue of the matrix B is greater than (logn)^^ we set 
i7*(x) equal to the projection of fj^^^x) on the interval [0,1], where fj^^^x) is the 
LP([/?J) estimator with a bandwidth h> and a kernel K satisfying (|3.3p - ()3.6j) . 
If the smallest eigenvalue of B is less than (logn)"^ we set fj^ix) = 0. 

Theorem 3.2 Let V be a class of probability distributions P on Z such that the 
regression function rj belongs to the Holder class S(/?, L, W^) and the marginal law of 
X satisfies the strong density assumption. Then there exist constants Ci, C2, C3 > 
such that for any < h < tq/c, any C^h^ < 5 and any n > 1 the estimator 57* 
satisfies 

sup P®"f |C(a;) - r]{x)\ > 5) < Ci exp ( - Ciuh'^d^) (3.7) 

for almost all x w.r.t. Px- As a consequence, there exist Ci,C2 > such that for 
1 

h = n 2/3+d and any 6 > 0, n > 1 we have 

sup P®"f |r/;(a;) - 7]{x)\ > s) < Ci exp ( - Can^^^) (3.8) 
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for almost all x w.r.t. Px- The constants Ci, C2, C3 depend only on (3, d, L, Cq, Tq, 
/imin,/imax; o,nd on the kernel K. 

Proof. See Section IHTTl ■ 

Remark 3.1 We have chosen here the LP estimators of t] because for them the 
exponential inequality (jS.lj) holds without additional smoothness conditions on the 
marginal density of X. For other popular regression estimators, such as kernel or 
orthogonal series ones, similar inequality can he also proved if we assume that the 
marginal density of X is as smooth as the regression function. 

Definition 3.1 For a fixed parameter a > 0, fixed positive parameters 
Co, To, Co, /3, i^, /imax > /^min > and a fixed compact C C M'*, let V-^ denote the 
class of all probability distributions P on Z such that 

(i) the margin assumption (MA) is satisfied, 

(a) the regression function rj belongs to the Holder class L, M'^), 

(Hi) the strong density assumption on Px is satisfied. 

Theorem 13. II and ()3.8|) immediately imply the next result. 

Theorem 3.3 For any n > 1 the excess risk of the plug-in classifier f* = II|^*>i-j. 
with bandwidth h = n 2/3+d satisfies 

sup \ERit) - Rif*)} < Cn-^ 
where the constant C > depends only on a, Cq, Ci and C2. 

_ /3(l + a) -1 

For a/3 > d/2 the convergence rate n obtained in Theorem 13.31 is a fast 

rate, i.e., it is faster than n^^^"^. Furthermore, it is a super-fast rate (i.e., is faster 
than n~^) for af3 > d. We must note that if this condition is satisfied, the class 
Vy, is rather poor, and thus super-fast rates can occur only for very particular joint 
distributions of {X,Y). Intuitively, this is clear. Indeed, to have a very smooth 
regression function rj (i.e., very large (3) implies that when rj hits the level 1/2, it 
cannot "take off" from this level too abruptly. As a consequence, when the density 
of the distribution Px is bounded away from at a vicinity of the hitting point. 
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the margin assumption cannot be satisfied for large a since this assumption puts an 
upper bound on the "time spent" by the regression function near 1/2. So, a and (3 
cannot be simultaneously very large. It can be shown that the cases of "too large" 
and "not too large" (a, /3) are essentially described by the conditions aP > d and 
< d. 

To be more precise, observe first that is not empty for a/? > d, so that the 
super-fast rates can effectively occur. Examples of laws P E under this condition 
can be easily given, such as the one with Px equal to the uniform distribution on a 
ball centered at in M*^, and the regression function defined by r^(a;) = 1/2 — C||x||^ 
with an appropriate C > 0. Clearly, rj belongs to Holder classes with arbitrarily 
large i3 and Assumption (MA) is satisfied with a = d/2. Thus, for d > 3 and (3 
large enough super-fast rates can occur. Note that in this example the decision set 
{x : 7]{x) > 1/2} has the Lebesgue measure in M.'^. It turns out that this condition 
is necessary to achieve classification with super-fast rates when the Holder classes 
of regression functions are considered. 

To explain this and to have further insight into the problem of super-fast rates, 
consider the following two questions: 

• for which parameters a, (5 and d is there a distribution P eVt, such that the 
regression function associated with P hits^ 1/2 in the support of P^? 

• for which parameters a, (5 and d is there a distribution P e Vy, such that 
the regression function associated with P crosses^ 1/2 in the interior of the 
support of Px? 

The following result gives a precise description of the constraints on (a, j3) leading 
to possibility or impossibility of the super-fast rates. 

Proposition 3.4 • Ifa{lA/3) > d, there is no distribution P eVy. such that the 
regression function rj associated with P hits 1/2 in the interior of the support 
ofPx- 

function / ; K'' ^ M is said to hit the level a e K at xq G M'' if and only if /(a;o) = a and 

for any r > there exists x G B{xo, r) such that f{x) ^ a . 

function / : M'' — > K is said to cross the level a G R a,t xo G R'^ if and only if for any r > 0, 

there exists X- and x+ in B{xo,r) such that f{x-) < a and f{x+) > a. 
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• For any a, (3 > and integer d > a{l A (3), any positive parameter L and any 
compact C C M'^ with non-empty interior, for appropriate positive parameters 
Co, Co, ro, /imax > /^min > 0, thcrc arc distributions P G such that the 
regression function r] associated with P hits 1/2 in the boundary of the support 
ofPx- 

• For any a, (3 > 0, any integer d > 2a, any positive parameter L and any 
compact C C M'^ with non-empty interior, for appropriate positive parameters 
Co, Co, ro, /imax > Atmin > 0, thcrc are distributions P E Vt, such that the 
regression function t] associated with P hits 1/2 in the interior of the support 
ofPx- 

• If a{l A (3) > 1 there is no distribution P G such that the regression 
function t] associated with P crosses 1/2 in the interior of the support of Px- 
Conversely, for any a, (3 > such thata{lAf3) < 1, any integer d, any positive 
parameter L and any compact C C M.'^ with non-empty interior, for appropriate 
positive parameters Co, cq, ro, /imax > /imm > 0, there are distributions P eV-e 
such that the regression function rj associated with P crosses 1/2 in the interior 
of the support of Px ■ 

Note that the condition a(l A /3) > 1 appearing in the last assertion is equivalent 
> , which is necessary to have super-fast rates. As a consequence, 

in this context, super-fast rates cannot occur when the regression function crosses 
1/2 in the interior of the support. The third assertion of the proposition shows that 
super-fast rates can occur with regression functions hitting 1/2 in the interior of the 
support of Px provided that the regression function is highly smooth and defined 
on a highly dimensional space and that a strong margin assumption holds (i.e. a 
large). 

Proof. See Section ■ 

The following lower bound shows optimality of the rate of convergence for the 
Holder classes obtained in Theorem 13.31 

Theorem 3.5 Let d > 1 be an integer, and let L,(3,a be positive constants, such 
that a(3 < d. Then there exists a constant C > such that for any n > 1 and any 
classifier /„ : T , we have 

sup {Ei?(/„) - > Cn- 
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Proof. See Section IHT^ ■ 

Note that the lower bound of Theorem 13.51 does not cover the case of super-fast 
rates (a/3 > d). 

Finally, we discuss the case where "a = oo" , which means that there exists > 
such that 

Px{0<\r]{X)-l/2\<to) =0. (3.9) 

This is a very favorable situation for classification. The rates of convergence of the 
ERM type classifiers under ()3.9|) are, of course, faster than under Assumption (MA) 
with a < oo [cf. Massart and Nedelec (2003)], but they are not faster than n^^. 
Indeed, Massart and Nedelec (2003) provide a lower bound showing that, even if 
Assumption (CAD) is replaced by a very strong assumption that the true decision set 
belongs to a VC-class (note that both assumptions are naturally linked to the study 
the ERM type classifiers), the best achievable rate is of the order (log n)/n. We show 
now that for the plug-in classifiers much faster rates can be attained. Specifically, 
if the regression function t] has some (arbitrarily low) Holder smoothness /3 the rate 
of convergence can be exponential in n. To show this, we first state a simple lemma 
which is valid for any plug-in classifier /„. 

Lemma 3.6 Let assumption VJ. 9^) he satisfied, and let fjn be an estimator of the 
regression function rj. Then for the plug-in classifier fn = i j. we have 

ER{U) - R{f*) < P(|r)„(X) - r/(X)| > to). 

Proof. Following the argument similar to the proof of Theorem 13.11 and using 
condition (j3.9|) we get 

ER{f\)-R{f*) < 2toPx{0<\v{X)-l/2\<to) 

+ E(|2r7(X) - l|%4x)^/.(x)}]I{|»?W-i/2|Mo}) 
= E(|2r7(X) - l|%„(x)^/.(x)}2{h(x)-i/2|>to}) 
< F{\UX)-vm>to). 

■ 

Lemma l3.(il and Theorem 13.21 immediatelv imply that, under assumption ()3.9j) . 
the rate of convergence of the plug-in classifier /* = ll|y^,>i| with a small enough 
fixed (independent of n) bandwidth h is exponential. To state the result, we denote 
by 'Py;,oo the class of probability distributions P defined in the same way as V^, with 
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the only difference that in Definition 13.11 the margin assumption (MA) is replaced 
by condition ()3.9p . 

Proposition 3.7 There exists a fixed (independent of n) h > such that for any 
n> 1 the excess risk of the plug-in classifier f* = ll|,-,>i| with bandwidth h satisfies 

sup {eRU:) - R{f*)} < C,exp{-C,n) 

where the constants 0^,05 > depend only on to, [3, d, L, cq, tq, /imm, A^max? (^nd 
on the kernel K. 

Proof. Use Lemma EIEl choose h > such that h < min(ro/c, (to/C's)^''^), and 
apply (|3.7|) with 5 = to- ■ 

Koltchinskii and Beznosova (2005) prove a result on exponential rates for the 
plug-in classifier with some penalized regression estimator in place of the locally 
polynomial one that we use here. Their result is stated under a less general condition, 
in the sense that they consider only the Lipschitz class of regression functions t], while 
in Proposition 13. 71 the Holder smoothness f3 can be arbitrarily close to 0. Note also 
that we do not impose any complexity assumption on the decision set. However, 
the class of distributions Ps.oo is quite restricted in a different sense. Indeed, for 
such distributions condition (j3.9|) should be compatible with the assumption that 
rj belongs to a Holder class. A sufficient condition for that is the existence of a 
band or a "corridor" of zero Px-measure separating the sets {x : ri{x) > 1/2} and 
{x : ri{x) < 1/2}. We believe that this condition is close to the necessary one. 

4 Optimal learning rates without the strong den- 
sity assumption 

In this section we show that if Px does not admit a density bounded away from zero 
on its support the rates of classification are slower than those obtained in Section 
3. In particular, super-fast rates, i.e., the rates faster than n~^, cannot be achieved. 
Introduce the following class of probability distributions. 

Definition 4.1 For a fixed parameter a > 0, fixed positive parameters 
Co, To, Co, /3, i^, /imax > and a fixed compact C C W^, let V-^ denote the class of 
all probability distributions P on Z such that 
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(i) the margin assumption (MA) is satisfied, 

(a) the regression function rj belongs to the Holder class L, M'^), 
(Hi) the mild density assumption on Px is satisfied. 

In this section we mainly assume that the distribution P of (X, Y) belongs to V^, 
but we also consider larger classes of distributions satisfying the margin assumption 
(MA) and the complexity assumption (CAR). 

Clearly, V-^ C V^. The only difference between and V-^ is that for the 
marginal density of X is not bounded away from zero. The optimal rates for are 
slower than for V-^. Indeed, we have the following lower bound for the excess risk. 

Theorem 4.1 Let d > 1 be an integer, and let L,f3,a be positive constants. Then 
there exists a constant C > such that for any n > 1 and any classifier fn : 2" T 
we have 

sup {Ei?(/„) - R{f*)] > Cn~^^+^W. 
Proof. See Section ■ 

In particular, when a = d/jS, we get slow convergence rate Xj^fn^ instead of 
the fast rate n ^/s+t* obtained in Theorem 13.31 under the strong density assumption. 
Nevertheless, the lower bound can still approach n"^, as the margin parameter a 
tends to oo. 

We now show that the rate of convergence given in Theorem 14.11 is optimal in 
the sense that there exist estimators that achieve this rate. This will be obtained 
as a consequence of a general upper bound for the excess risk of classifiers over a 
larger set V of distributions than "P^- 

Fix a Lebesgue measurable set C C M"^ and a value 1 < p < oo. Let S be a 
class of regression functions ri on M'^ such that Assumption (CAR) is satisfied where 
the e-entropy is taken w.r.t. the Lp(C, A) norm (A is the Lebesgue measure on M'^). 
Then for every e > there exists an e-net on S w.r.t. this norm such that 

log (card A/;) < J^e^P, 

where A! is a constant. Consider the empirical risk 

1 

i=l 
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and set ^ 

j if p = oo^ 

Bn = en{a,p,p) = < P+c 

I n (2+a)p+p{p+ti) if 1 < p < OO. 
Define a sieve estimator 17^ of tlie regression function r] by tfie relation 

r/f e Argmin^g^^^^i?,(/^) (4.1) 

where /fj(x) = I{f;(a;)>i/2}, and consider the classifier = lI|^s>i/2}- Note that 
can be viewed as a "hybrid" plug-in/ ERM procedure: the ERM is performed on a 
set of plug-in rules corresponding to a grid on the class of regression functions rj. 

Theorem 4.2 Let V be a set of probability distributions P on Z such that 
(i) the margin assumption (MA) is satisfied, 

(a) the regression function rj belongs to a class S which satisfies the complexity 
assumption ( CAR) with the e-entropy taken w.r.t. the Lp{C, A) norm for some 
I < p < 00, 

(Hi) for all P eV the supports of marginal distributions Px are included in C. 
Consider the classifier f^ = ]l{^s>i/2}- If p = 00 for any n > 1 we have 

sup|Ei?(/„^) - R{f*)] < Cn-^. (4.2) 

1 < P < 00 and, in addition, for all P E V the marginal distributions Px are 
absolutely continuous w.r.t. the Lebesgue measure and their densities are uniformly 
bounded from above by some constant /Xmax < 00, then for any n > 1 we have 

sup|Ei?(/^) - R{f*)] < CnT (2+-')X\p+<.) . (4.3) 
Proof. See Section lOl ■ 

Theorem 14.21 allows one to get fast classification rates without any density as- 
sumption on Px- Namely, define the following class of distributions P of (X, Y\ 

Definition 4.2 For fixed parameters a > 0, Co > 0, [3 > 0, L > 0, and for a fixed 
compact C C M'^, let "P^ denote the class of all probability distributions P on Z such 
that 
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(i) the margin assumption (MA) is satisfied, 

(a) the regression function rj belongs to the Holder class L, M'^), 
(Hi) for all P & V the supports of marginal distributions Px are included in C. 

If C is a compact the estimates of e-entropies of Holder classes E(/3,L,M'^) in 
the Loo{C, A) norm can be obtained from Kolmogorov and Tikhomorov (1961), and 
they yield Assumption (CAR) with p = d/ (3. Therefore, from ()4.2|) we easily get the 
following upper bound. 

Theorem 4.3 Let d > 1 be an integer, and let L,j3,a be positive constants. For 
any n > 1 the classifier f^ = ll|^s>i/2} defined by (|^.i| ) with p = oo satisfies 

sup <^ ER{f^) - R{f*) \ < Cn"(2+-)/3+rf 
with some constant C > depending only on a, (3, d, L and Cq. 

_ (l + a)/3 

Since C V^, Theorems 13.51 and 14.31 show that n (2+a)/3+d is optimal rate of 
convergence of the excess risk on the class of distributions V^. 

5 Comparison lemmas 

In this section we give some useful inequalities between the risks of plug-in classifiers 
and the Lp risks of the corresponding regression estimators under the margin as- 
sumption (MA). These inequalities will be helpful in the proofs. They also illustrate 
a connection between the two complexity assumptions (CAR) and (CAD) defined in 
the Introduction and allow one to compare our study of plug-in estimators with that 
given by Yang (1999) who considered the case a = (no margin assumption), as well 
as with the developments in Bartlett, Jordan and McAuliffe (2003) and Blanchard, 
Lugosi and Vayatis (2003). 

Throughout this section f/ is a Borel function on Mf^ and 

/(x) = lI{f)(x)>l/2}- 

For 1 < p < oo we denote by || ■ ||p the Lp[M.'^, Px) norm. We first state some 
comparison inequalities for the L^o norm. 
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Lemma 5.1 For any distribution P of{X, Y) satisfying Assumption (MA) we have 



R{f)-R{n<^Co\\v-r] 



l+a 



(5.1) 



oo 



and 



Px{f{X) ^ r(X), r/(X) ^ 1/2) < Co\\v-v 



a 



(5.2) 



Proof. To show ()5.H) note that 



Rif) 



= E(|2r/(X)-l|%(^)^^.(^„) 

< 2E{\r]{X) - ||]Io<{|r,(X)-i|<|r,(X)-f)(X)|}) 

< 2\\r] - f]\\^Px{0 < \v{X) - i| < \\r] - r/|U) 

< 2Co\\r]-r]\\^+-. 



Similarly, 



Px(/(X)^r(X),r^(X)^l/2) < Px{0<\v{X)-l\<\r^{X)-rj{X)\) 

< P^(0<|r^(X)-i|<||r/-r^|U) 

< Co\\v-v\\^. 



Remark 5.1 Lemma \5. 1\ offers an easy way to obtain tfie result of Tfieorem \'J.'Jl in 
a sligfitly less precise form, with, an extra logarithmic factor in the rate. In fact, 
under the strong density assumption, there exist nonparametric estimators fjn (for 
instance, suitably chosen locally polynomial estimators) such that 



uniformly over rj E S(/3,L,R"') [see, e.g.. Stone (1982)]. Taking here q = 1 + a and 
applying the comparison inequality \5.1\] we immediately get that the plug-in classifier 
fn = 2{,)„>i/2} has the excess risk S{fn) of the order (n/ logn)"^*-"'^^"-'''*-^'^^'^^ 

Another immediate application of Lemma EH] is to get lower bounds on the risks 
of regression estimators in the norm from the corresponding lower bounds on 
the excess risks of classifiers (cf. Theorems 13.51 and 14. 1|) . But here again we loose a 
logarithmic factor required for the best bounds. 

We now consider the comparison inequalities for Lp norms with 1 < p < cxd. 




V g > 
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Lemma 5.2 For any 1 < p < oo and any distribution P of {X,Y) satisfying As- 
sumption (MA) with a > we have 

R{f) - R{f*) < C,{a,p)\\v - (5.3) 

and 

Px{f{X) ^ r(X), r^iX) ^ 1/2) < C,{a,p)\\r^-r]\\r, (5.4) 

where Ci{a,p) = 2{a + p)p-^ {^)^ , €2(0, p) = (a + p)p-^{^)^cf^ . In 
particular, 



1 + a 

2 + a 



R{f) - R{f*) < C,{a,2) [rj{x) - v{x)]'Px{dx) j . (5.5) 
Proof. For any t > we have 
R{f)-R{f*) 

= 2E[|r7(X) - l/2|]I|/(^)^^.(^)}]I|o<|^(x)-i/2|<t}] 
+ 2E[|r/(X) - l/2|I|^-(^)^^.(^)}]I||,(x)-i/2|>i}] 

< 2E[|r/(X) - n{X)\l{o<Mxyi/2\<t}] + 2E[|r/(X) - rj{X)\lL{\,^x)-nix)\>t}] 
<2\\r^-r^\\,[Px{0<\v{X)-l/2\<t)]"^ + '^^^^^^ (5.6) 

by Holder and Markov inequalities. So, for any t > 0, introducing E = ||?7 — "/^Hp and 
using Assumption (MA) to bound the probability in (j5.6p we obtain 

R{f) - R{f*) <2l^C,^ t—E + ^^j. 

Minimizing in t the RHS of this inequality we get ()5.3|1 . Similarly, 

Px(/(X)^r(X),r/(X)^l/2) < Px{0<\^{X)-l/2\<t)+Px{\v{X)-r^{X)\ 

< Cot" + ^^^, 

and minimizing this bound in t we obtain ()5.4|) . ■ 

If the regression function r] belongs to the Holder class L,M°') there exist 
estimators fin such that, uniformly over the class. 



e| [UX) - v{X)f} < Cn-^^ (5.7) 
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for some constant C > 0. This has been shown by Stone (1982) under the additional 
strong density assumption and by Yang (1999) with no assumption on Px- Using 
fl5.7|) and ()5.5|) we get that the excess risk of the corresponding plug-in classifier 

2/3 1 + a 

fn = ^{fj„>i/2} admits a bound of the order n 2/3+^2+0, which is suboptimal when 
a ^ (cf. Theorems 14. 21 . In other words, under the margin assumption, Lemma 
15.21 is not the right tool to analyze the convergence rate of plug-in classifiers. On 
the contrary, when no margin assumption is imposed (i.e., a = in our notation) 
inequality (jl.2p . which is a version of (j5.5p for a = 0, is precise enough to give the 
optimal rate of classification [Yang (1999)]. 

Another way to obtain ()5.5|) is to use Bartlett, Jordan and McAuliffe (2003): it is 
enough to apply their Theorem 10 with (in their notation) (f){t) = (1 —t)^,ilj{t) = 
and to note that for this choice of we have R,j,{f]) — R*^ = — '^||2- Blanchard, 
Lugosi and Vayatis (2003) used the result of Bartlett, Jordan and McAuliffe (2003) 

2(l + a) 

to prove fast rates of the order n for a boosting procedure over the class of 

regression functions rj of bounded variation in dimension d = 1. Note that the same 
rates can be obtained for other plug- in classifiers using (|5.5|) . Indeed, if t] is of 
bounded variation, there exist estimators of 77 converging with the mean squared L2 
rate n~^/^[cf. van de Geer (2000), Gyorfi et al. (2002)], and thus application of ()5.5|) 

2(l + a) 

immediately yields the rate n 3{2+q) for the corresponding plug-in rule. However, 
Theorem 14.21 shows that this is not an optimal rate (here again we observe that 
inequality (j5.5|) fails to establish the optimal properties of plug- in classifiers). In 
fact, let d = 1 and let the assumptions of Theorem 14.21 be satisfied, where instead of 
assumption (ii) we use its particular instance: rj belongs to a class of functions on 
[0, 1] whose total variation is bounded by a constant L < 00. It follows from Birman 
and Solomjak (1967) that Assumption (CAR) for this class is satisfied with p = 1 
for any 1 < p < 00. Hence, we can apply ()4.3|) of Theorem 14.21 to find that 

sup|Ei?(/;f) - i?(/*)| < Cn~ (5.8) 

for the corresponding class V. If p > 2 (recall that the value p G [1, 00) is chosen by 
the statistician), the rate in ()5.8j) is faster than n 3(2+^) obtained under the same 
conditions by Blanchard, Lugosi and Vayatis (2003). 
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6 Proofs 



6.1 Proof of Theorem EH 

Consider a distribution P in Ps- Let A be the support of Px- Fix x & A 
and (5 > 0. Consider the matrix B = (55,^9)1 m 1^1 with elements -85,5, = 
^T^dU^^^'^'^ K{u)ji{x + hu)du. The smallest eigenvalue of i? satisfies 

= min||v^'|j=i ly^-Biy 

> min||vF|!=iVr^SVr + min||vy||=iW'^(5-S)W^ (6.1) 



> min||vp'||=i W^BW - E|si|,|s2|<L/3J - Bs^^s2\- 

Let yl„ = |m G M"' : ||n|| < c; x + hu E A^ where c is the constant appearing in 
()3.3p . Using ()3.3|) . for any vector ty satisfying = 1, we obtain 

W^BW = J^4Zis\<mWsufK{u)fi{x + hu)du 

By assumption of the theorem, ch < r^. Since the support of the marginal distribu- 
tion is (co, ro) -regular we get 

X[An] > h-'^X[B{x,ch) nA]> coh-'^\[B{x,ch)] > co^dC^ 

where Vd = A[i3(0, 1)] is the volume of the unit ball and Cq > is the constant 
introduced in the definition (j2.ip of the (cq, ro)-regular set. 

Let A denote the class of all compact subsets of B{0, c) having the Lebesgue 
measure coVdC'^. Using the previous displays we obtain 

min BW > ctimin min / ( V Wsu'fdu = 2uo. (6.2) 

||w||=i |iiy|i=i;Se^ Jq 

II II II II . |^|<|^^j 

By the compactness argument, the minimum in (|6.2p exists and is strictly positive. 
For i = 1, . . . ,n and any multi- indices Si, S2 such that |s2| < [/3\, define 

T^ = I^ i^r^^" K (^) - u^-+^'^K{u)ii{x + hu)du. 

We have ET^ = 0, \Ti\ < /i-^sup^g^d (l + \\uV'^)K{u) = Kih-'^ and the following 
bound for the variance of Tj: 

VarT. < Mi^r^'^'K^-^) 

= ^ J^a U^''+^''K\u)l2{x + hu)du 
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From Bernstein's inequality, we get 

- > e) = (|i Er=i^.| > ^) < 2exp {-,;a£73} " 

This and dnH) - (Q imply that 

P®"(A5 < /io) < 2M2 exp ( - Cnh'^) (6.3) 

where is the number of elements of the matrix B. Assume in what follows that 
n is large enough, so that /io > (logn)~^. Then for > /iq we have |i7*(a;) —ri{x)\ < 
\Vn^{^) ~ v{^)\- Therefore, 

P^^'il^ix) -ri{x)\ >5)< P®"(Ab < fio) +P®"(|?7^^(x) -r/(x)| > 5, A^ > /io). 

(6.4) 

We now evaluate the second probability on the right hand side of ()6.4p . For A^ > /io 
we have r]^^{x) = U''"{0)Q^^V (where V is given by ()2.3|) ) . Introduce the matrix 
^ - (^*.^)l<^<n,|s|<L/3J elements 

The s-th column of Z is denoted by Zg, and we introduce Z^^'^ = Xl|s|<[/3j ^s- 
Since Q = Z'^Z, we get 

V|s| < L/3J : t/^(0)Q-iZ^Z, = ]I{.=(o,...,o)}, 
hence ?7"^(0)Q~^Z"^Z'^^) = //(a;). So we can write 

r/^^(a;) - r]{x) = U''{0)Q~\V - Z^Z^'^^) = f/^(0)P-ia 

where a = :^H{V — Z'^Z^^^) E M^^ and i7 is a diagonal matrix H = 
(^^i,^2)|,,|,|,2|<L/3j with Hs,,s2 = h~''l{s^=s2}- For A^ > /io we get 

\Vn^{^) ~ v{^)\ ^ ll-^""^a|| < A^^||a|| < /io ^||a|| < /ig ^Mmax^ \as\, (6.5) 
where are the components of the vector a given by 

as = ^ El, [Y. - VAX.)] {^YK (^) . 

Define 
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We have 



< I ^ Er=i Tt''^ I + I Er=i [rr^ - ET/^'^T I + |ET, 



■^(^,2)1 



(6.6) 



Note that ET, 



0, |t/''^^| < fti/i-'^, and 



Varr 



>,i) 



< J u'^'K'^{u)fi{x + hu)du < (/t2/4)/i-'^, 

< h~'^L^ J h^^\\uf'+^'^K^{u)fi{x + hu)du < L^K2h^^-<^. 



VarT/''^) 

From Bernstein's inequahty, for any €1,62 > 0, we obtain 



l_ V^n rp 

n ^i=l i 



> ei ) < 2 exp 



nh e 



d^2 



K2/2+2ftiei/3 



and 



Since also 

|ET/''^^| < L/i'^ j \\uf+^K'^{u)^i{x + hu)du<Lti2h^ 
we get, using ()6.6p . that if 3jj,Q^MLK2h^ < 5 < 1 the following inequality holds 



Combining this inequality with (|6.3p . (|6.4p and (j6.5|) . we obtain 
P®"(|r);(x) - r7(x)| > (5) < exp - C2nh'^6^^ 



(6.7) 



for 3m~^MLK2h^ ^ (for 5 > 1 inequality ()6.7p is obvious since 17*, 77 take values in 
[0, 1]). The constants Ci,C2 in (j6.7j) do not depend on the distribution Px, on its 
support A and on the point x G A, so that we get ()3.7p . Now, ()3.7p implies ()3.8|) for 
Cn 2/3+d < 5^ and thus for all 5 > (with possibly modified constants Ci and C2). 



6.2 Proof of Theorems EH] and lO 

The proof of both theorems is based on Assouad's lemma [see, e.g., Korostelev and 
Tsybakov (1993), Chapter 2 or Tsybakov (2004b), Chapter 2]. We apply it in a 
form adapted for the classification problem (Lemma 5.1 in Audibert (2004)). 
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For an integer g > 1 we consider the regular grid on R*^ defined as 

^ " 1 i 2q ' ■ ■ ■ ' 2q I ■ ^ {0' ■ ■ ■ ' ? " 1}' « = 1' ■ ■ ■ ' 

Let nq{x) G be the closest point to a; G M'' among points in Gg (we assume 
uniqueness of ng{x): if there exist several points in Gg closest to x we define nq{x) as 
the one which is closest to 0). Consider the partition X[, . . . , A'^^ of [0, l]'^ canonically 
defined using the grid Gg {x and y belong to the same subset if and only if ng{x) ~ 
ng{y)). Fix an integer m < q"^. For any i G {1, . . . , m}, we define Xi = A"/ and 
A'o = R'' \ U^iA;, so that Xo,...,Xm form a partition of W^. 

Let u : 1R+ — > 1R+ be a nonincreasing infinitely differentiable function such that 
M = 1 on [0,1/4] and u — Q on [1/2, oo). One can take, for example, u{x) — 
'tJ'i{t)dt^ ui{t)dt where the infinitely differentiable function ui is defined 



as 

ui{x) 




(1/2-.K.-1/4) } for xG (1/4, 1/2), 

otherwise. 



Let : R"^ — >■ R_|_ be the function defined as 

(l>{x)^cMM), 

where the positive constant C<^ is taken small enough so ensure that |0(x') — 0j;(x') | < 
L\\x' - xW^^ for any x,x' G R*^. Thus, G E(/3,L,R<^). 

Define the hypercube H — {F^ : a — ((Ti, . . . , cr^) G {—1, 1}"*} of probability 
distributions of {X, y) on Z = R*^ x {0, 1} as follows. 

For any e H the marginal distribution of X does not depend on a, and has a 
density // w.r.t. the Lebesgue measure on R*^ defined in the following way. Fix < 
w < and a set Aq of positive Lebesgue measure included in Xq (the particular 
choices of Aq will be indicated later), and take: (i) iJi{x) — w/\[B{Q,{Aq)~^)] if x 
belongs to a ball B{z,{Aq)~^) for some z G Gd, (n) /x(a;) = (1 — mw)/A[Ao] for 
X G Aq, (in) /i(x) ~ for all other x. 

Next, the distribution of Y given X for G is determined by the regression 
function r]g{x) = P{Y = 1\X = x) that we define as r]ff{x) = for any x G Xj, 

j = 1, . . . ,m, and r]^ = 1/2 on Xq, where (p{x) = q^'^(f)[q[x — nq{x)]) . We will assume 
that < 1 to ensure that (p and rjg take values in [0, 1]. 
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For any s G N'^ such that |s| < \_(3\, the partial derivative D^(p exists, and 
D^Lp{x) = q^^^~^D'^(f)(^q[x — nq{x)]Y Therefore, for any i G {1, ...,m} and any 
X, x' G Xi, we have 

\ip{x') -^^{x')\ < L||a;-a;'||^. 

This imphes that for any a G { — 1, l}™ the function t]^ belongs to the Holder class 
S(/5,L,M'^). 

We now check the margin assumption. Set xq = (^;---;^)- For any a G 
{— 1, 1}"^ we have 

Fff{0 < \r]^{X) - 1/2| < t) = mP^ (0 < 0[g(X - Xq)] < 2tq^) 

= ^ ^B{xQ,{iq)-^) '^{0«i>[q{x-xo)]<2tqP} A[Z3(0,(4<?)-1)] 
~ A[B(0,l/4)] Ji3(0,l/4) -^{9i{a;)<2tg/3}"J' 

Therefore, the margin assumption (MA) is satisfied if mw = 0(g~"^). 

According to Lemma 5.1 in Audibert (2004), for any classifier we have 



where 



b 



sup{Ei?(/„) - R{f*)} > mwb'{l - by^)/2 (6.8) 

P&H 



1 1/2 



b' = J^^ Lp{x)jii{x)dx = C^q'^ 

with /ii(x) = li{x)/ J^^fi{z)dz. 

We now prove Theorem IH. 51 Take q = [Cn^/s+dJ ^ ^ = C'q~'^ and m = [C'g'*^"^] 
with some positive constants C,C',C" to be chosen, and set Aq = [0, 1]"' \ W^^Xi. 
The condition a(3 < d ensures that the above choice of m is not degenerate: we 
have m > 1 for C" large enough. We now prove that 7i C Vy. under the appropriate 
choice of C, C, C". In fact, select these constants so that the triplet (g, w, m) meets 
the conditions m < q'^, Q < w < m~^, mw = 0{q^°'^). Then, in view of the 
argument preceding ()6.8p . for any aG { — l,l}™the regression function rjg belongs 
to and Assumption (MA) is satisfied. We now check that Px obeys the 

strong density assumption. First, the density jj,{x) equals to a positive constant for 
X belonging to the union of balls U^iB{zi, (4g)~^) where Zi is the center of Xi, and 
fi{x) = (1 — mw)/{l — mq^'^) = 1 + o(l), as n ^ oo, for x G Aq. Thus, /imin < 
/i(x) < /imax for some positive /^min and /^max- (Note that this construction does not 
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allow to choose any prescribed values of /imin and 

/-''max; bcCcLllSG = 1 + 0(1). 

The problem can be fixed via a straightforward but cumbersome modification of the 
definition of Aq that we skip here.) Second, the (cq, ro)-regularity of the support 
A of Px with some Cq > and tq > follows from the fact that, by construction, 
X{A n B{x, r)) = (1 + o(l))A([0, 1]"' n B{x, r)) for all a; G A and r > (here agam we 
skip the obvious generalization allowing to get any prescribed cq > 0). Thus, the 
strong density assumption is satisfied, and we conclude that H cVs. Theorem 13.51 
now follows from (|6.8|) if we choose C small enough. 

Finally, we prove Theorem 14. 11 Take q = [CnT2+^)^J ^ yj = C'q^^ /n and m = q'^ 
for some constants C > 0, C" > 0, and choose Aq as a Euclidean ball contained in 
Xq. As in the proof of Theorem 13.51 under the appropriate choice of C and C", the 
regression function rfg belongs to S(/?,L,]R'^) and the margin assumption (MA) is 
satisfied. Moreover, it is easy to see that the marginal distribution of X obeys the 
mild density assumption (the (cq, ro)-regularity of the support of Px follows from 
considerations analogous to those in the proof of Theorem 13. 5|) . Thus, Ti C V'-^. 
Choosing C small enough and using (j6.8p we obtain Theorem 14.11 

6.3 Proof of Proposition 13.41 

The following lemma describes how the smoothness constraint on the regression 
function rj at some point a: G M'' implies that rj "stays close" to 77 (x) in the vicinity 
of X. 

Lemma 6.1 For any distribution P G Vt, with regression function rj and for any 
K > 0, there exist L' > and to > such that for any x in the support of Px and 
< t <to, we have 



P 



X 



\r]{X) - 7]{x) \<t;X e B{x, Kt^) 



> L't^. 



Proof of Lemma 16.11 Let A denote the support of Px- Let us first consider 
the case P < 1. Then for any x,x' G W^, we have w^x') — r]{x)\ < L\\x' — a;||^. Let 
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K 



A For any < t < Ltq, we get 



P 



X 



\r]{X) -T]{x)\ <t;X e B{x,Kt^) 



\r]{X) - r]{x) I < t; X G B{x, Kt^) n A 
X G B{x, Ktl^ A (^)^) nA 
B{x, K'tTi) n A 



= Px 
>Px 



which is the desired result with L' < CQ^,rainVd{i^'Y and to < Lr^. 

For the case /5 > 1, by assumption, r/ is continuously differentiable. Let C{A) be 
the convex hull of the support A of Px- By compactness of C(A), there exists C > 
such that for any s G N"^ with \s\ = 1, 



sup |D*?7(x)| < C. 



So we have for any x, x' G A, 



\ri{x) — ri{x')\ < C\\x — x'\\. 
The rest of the proof is then similar to the one of the first case. ■ 

• We will now prove the first item of Proposition IH.4I Let P G Vt, such that 

o o 

the regression function associated with P hits 1/2 at Xq G A, where A denotes 
the interior of the support of Px- Let r > such that B{xo,r) C A. Let 
X G B{xo, r) such that r]{x) ^ |. Let ti = \r]{x) — 1/2|. For any < t < ti, let 
Xt G [xq] x] such that \r]{xt) — 1/2| = t/2. We have Xt E Aso that we can apply 
Lemma (6. II (with k = 1 for instance) and obtain for any < t < ti A (4to) 



X 



< \7]{X) - 1/2| < t 



> P 



X 



\t]{X) -7]{xt)\ <t/A >L'{t/A)^. 



Now from the margin assumption, we get that for any small enough t > 
Cot" > L'(t/4)^, hence a < j^. 

For the second item of Proposition 13.41 to skip cumbersome details, we may 
assume that C contains the unit ball in W^. Consider the distribution such 
that 
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Px is the uniform measure on 
■■■ + \xd\ < 1/4} 

the regression function associated with P is 

1 + C,,sign(a;i)|a;i|^^"'"u(a;i) 



r]{xi, ...,Xd) 
where 



exp ( - j^) if \t\ < 1 
otherwise, 



u{t) = 

and < C,, < 1 is small enough so that for any x, x' G M'^, rj satisfies 

\'ri{x') — "/^xl^;')! — -^11^ ~ ^'11^- 

For appropriate positive parameters Co,ro,/Uinax > /Umin > 0, the only non- 
trivial task in checking that P belongs to Vy, is to check the margin assumption. 
For t small enough, we have 



P 



X 



r]{X)-l/2\<t <Px |X/^i<Ct;|Xi-l/4| + |X2| + --- + |Xrf| <l/4 



< h(X) - 1/2| < t 



d 



< Ct—. So 

d 



for some C > 0. Therefore, we have Px 

the margin assumption is satisfied for an appropriate Cq whenever a < -0^. 
Since rj hits 1/2 at Ojjd which is in boundary of the support of Px, we have 
proved the second assertion. 

For the third assertion of Proposition 13.41 to avoid cumbersome details again, 
we may assume that C contains the unit ball in W^. Consider the distribution 
such that 

— Px is the uniform measure on the unit ball, 

— the regression function associated with P is 

, , l + C^||xf^i(||a:||V2) 
V{x) = , 

where < < 1 is small enough so that for any x, x' G M'^, t] satisfies 

\r]{x') — r]x{x')\ < L\\x — x'\\^. 
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For appropriate positive parameters Cq, Cq, ro, /Umax > l^min > 0, the distri- 
bution P belongs to provided that a < d/2 (in order that the margin 
assumption holds). We have obtained the desired result since r] hits 1/2 at O^d 
which is in the interior of the support of Px- 

• For the last item of Proposition 13.41 let P G Vy. such that the regression 

o 

function rj associated with P crosses 1/2 at Xq & A. For d = 1, from the first 
item of the theorem, we necessarily have A 1) < 1. Let us now consider 
the case: d > 1. 

Figure ^ will help to keep track of the following notation. 




Figure 1: Notation summary 

Let ri > such that B{xo,3ri) C A. Introduce x_ and x+ in B{xo,ri) such 
that r]{x^) < 1/2 and r]{x+) > 1/2. Let h = (l/2 - r]{x^)) A {r]{x+) - 1/2). 
Define y = ^-~^^+ ^ = ||^+~|^~|| and D = — a;_||. Let ei, . . . ,erf„i be unit 
vectors such that ei, . . . , is an orthonormal basis of M'^. Let B*{x, r) (resp. 
5*(x,r)) denote the ball (resp. the sphere) centered at x and of radius r wrt 
the norm = supi<j<^ ei)|. 

Since t] is continuous, there exists r2 > such that 

r]{x) < 1/2 — ti/2 for any x G i3*(x_, r2) 
?7(x) > 1/2 + ti/2 for any X G i3*(x+, r2) 
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Let ( = For any k = (fci, . . . , k^-i) E ^, introduce 

d-\ 

1=1 

For any t in ]0; consider the grid G = {y/,.; k E Z'^"^, max \ki\ < ^-^=^ j. 
For any yk in G, we have \\yk — y\\ < — 1 max |t^A;j| < < ri. There- 

l<i<d— 1 

fore, using that y E B{xo,ri), the grid G is included in B{xo,2ri). For 
any ?/fc G G, let = [x_;?/a:] n S*{x-,r2) and = [x+;yk] H 5*(x+,r2). 
Since - y\\ < D/2, we have y^ = x_ + r2ed + YltZl ^iCi and 

For any y^ in G, consider the continuous path formed by the segments [?/^; yk] 
and [y^; y'^]. Since r/ is continuous on this path, there exists Wk E 'jk — [z/^; yk]^ 
[yk, y^] such that ri{wk) = 1/2 + 1/2. Now let us show that when k ^ k', Wk 
and Wk' are at least away from each other. The distance between Wk 

and Wk> is not less than the distance between the paths jk and 7^'. Let U 
denote the biggest integer smaller than or equal to 2y/d~it'- ' ^^^^ yk yk' in 
G, the distance between 7^ and jk' is minimum for k = K = (U, . . . ,U) and 
k' = K' = (U — 1,U, . . . ,U). This distance is equal to the distance between 
and its orthogonal projection on [yj^,;yK'], which is the distance between 
y]^ and the line yx')- Let K" = {0,U, . . . ,U) E U^^^ . To compute this 
distance V , it suffices to look at the plane {x-\ yx"] Vk) (see figure I^J- 




X 



Figure 2: plane {x_;yK";yK) 



We obtain that the angle 6 between yx' — x^ and yx" — is smaller than 
7r/4. As a consequence, V = \\yj^- — yj^,\\ cos9 > ^/2r2t'' /D. 

Finally, focusing on the behaviour of the regression function near the w^s, by 
using Lemma with k, = we obtain that there exists L' > and to > 
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\^(X)-r^iw,)\<t/A;XeB{wk,^) 



such that for any < t < 4to A ti, 
Cot" > Px[0<\v{X)-'^\<t 

> E Px 

fceZ^*-!: max |fci|< 

> (2f/+ l)'^^iL'(t/4)'='^ 

> Ct^, 

hence a < ( (which is the desired resuh). 

For the converse, the proof is similar to the ones of the second and third 
assertions of the proposition. Without loss of generality, we may assume that 
S = \ {xi, ■ ■ ■ ,Xd) G M'^ : maxlxjl < 1/2) is a subset of C. we consider the 

l<i<d ' 

distribution P such that 



Px is the uniform measure on S 

the regression function associated with P is 

1 + C^sign(a;i)|a;i|^^^ 



r]{xi, ...,Xd) 

where < < 1 is small enough so that for any x, x' G 

\ri{x') — rix{x')\ < L\\x — x'\\^ . 
For small enough t > 0, we have 



7] satisfies 



P 



X 



r/(X)-l/2|<t < Px[\Xi\^''^ < Ct], 



for some constant C > 0, so that we have Px < \'r]{X) — \\<t < 2{Ct)T^ . 
As a consequence, for appropriate parameters Co, cq, ro, /imax > Atmin > 0, the 
distribution P belongs to V^. whenever a < Since rj crosses 1/2 at 
which is in the interior of the support of Px-, the converse holds. 



6.4 Proof of Theorem 14.21 

We prove the theorem for p < oo. The proof for p 
decision rule / we set d{f) = R{f) — R{f*) and 

.... A ) /*(^) if ^(^) ^ V2, 
^ fix) if r^(x) = l/2. 



oo is analogous. For any 



Vx G 
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Lemma 6.2 Under Assumption (MA) for any decision rule f we have 

Px{f{X) ^ nx, /)) < (6.9) 

Proof. Note that /**(■,/) is a Bayes rule, and following the same lines as in 
Proposition 1 of Tsybakov (2004a) we get Px{f{X) ^ f**{XJ), ri{X) ^ 1/2) < 
Cd(/)°/(i+"). It remains to observe that Px{f{X) ^ f**{XJ), 7]{X) ^ 1/2) = 
Px(/(X)^r(X,/)). ■ 

For a Borel function r/ on R'^ define ff^ = lI{j^>i/2}, f^{-) = f**{-, ff^) and 

Let rjn be an element of A4„ such that ||?7„ — ri\\p^x < En, where || ■ ||p^A is the 
Lp{C, A) norm. In view of the assumption on V we have ||?7„ — rjWp < fiUax£n where 
II • lip is the Lp{M.'^, Px) norm. It follows from the comparison inequality (|5.3p that 

(i+q)p 

d{U) < Csn'^" - 5„. Set 

A„ = Cn (2+a)p+p{p+Q) 

(i.e., is of the order of desired rate). Fix t > and introduce the set 

M: = {ne K„ : d{f^) > tA4. 

For any t > we have 

P(rf(/^)>tA„) < P(min[i?„(/^)-i?„(/,J] <0) 

= P(min [Z„(/,) - Z„(/,J + dif,) - dilj] < 0) 

< P(min [Z„(/^) - Z„(/,J + d{f^)/2 + tAj2 - rf(/,J] < 0) 

< P(min[Z„(/^) + rf(/^)/2] <0) 



+nZnifr,J>tAj2-difr,J) 

P(min[Z„(/^) + d(/^)/2] <0) 
+P(Z„(/,J >tA„/2-5„). 
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Since A„ is of the same order as we can choose t large enough to have tAn/2—Sn > 
tA„/4. Thus, 

P(d(/^) > tA„) < card AC maxP(Z„(/^) < -d{f^)/2) 
+P(Z„(^J > tA„/4) 
< expiA'e-^) maxP(Z„(/^) < -rf(/^)/2) 

+P(^n(/,J >tA„/4). 

Note that for any decision rule / the value Zn{f) is an average of n i.i.d. bounded and 
centered random variables whose variance does not exceed Px{f{X) ^ /**(X, /)). 
Thus, using Bernstein's inequality and (j6.9j) we obtain 

/ Cna^ \ 

Fi-ZM) >a)< exp [-^^^^Jy^) , V a > 0. 

Therefore, for f] E Af*, 

nZn{f,)<-d{fn)m < exp(-Cnt/(/,)(2+")/(i+")) 

< exp(-Cn(tA„)(2+")/(i+")). 

Similarly, for t > C, 

CnAl 



P(^n(/,J > tA„/4) < exp 



< exp 



A„ + 



< exp(-CnA(f+")/(^+")) . 
The result of the theorem follows now from the above inequalities and the relation 
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